PROGRESS REPORT 4


OTU Analysis


Alpha diversity



Estimated and observed species richness

Figure x: Stacked barplots for species richness. The estimated richness (green bars) wascalculated using chao calculator and observed ichness (red bars) was calculated using sobs.


Boxplots, density plots and histograms

Figure x: Species richness (observed species) displayed by boxplot (A), density plots (B) and histograms (C).


Figure x: Correlation between species richness and sequence depth. Observed species calculated using sobs (A) and estimated species richness by chao calculator (B).


Species Diversity


Figure x: Species diversity and correlation to species richness. Definitely phylo-diversity (C) correlates well with the species richness.


Species accumulation


The number of OTUs observed as a function of sampling effort was determined using species accumulation methods as described (Oksanen et al. 2018). We employed four different methods namely exact, random, collector and rarefaction to investigate if there were any new species added as a function of sampling. The results can be used to guide the decision for re-sampling or proceeding to downstream analysis.





Rarefaction and Extrapolation (R/E)


Type 1: Sample-size-based R/E curve


Type 2: Sample completeness curve

Figure x: Species diversity estimates as a function of sample size. Only species with abundance greater or equal to 1 are detected in the sample.


Type 3: Coverage‐based R/E curves



Figure x. Rarefaction and extrapolation curves. Sample-size-based curve (A), sample completeness curve (B), Coverage‐based curves (C).


Shannon diversity index


Inverse Simpson diversity index


Beta diversity


Heatmaps


Phyla


Class


Order


Family


Genus


PAM clustering


  • Partitioning Around Medoids (PAM)
  • Is considered to be the more robust version of K-means.
  • Medoids are representative objects of the cluster.
  • Starts by determining the best number of clusters using factoextra::fviz_nbclust()
  • Method: Silhouette
  • Metric = Euclidean
  • Robust for partitioning data set into clusters of observation.
  • User are required to know the data to indicate the appropriate number of clusters to be produced.
  • Visualize clusters (pam results) using factoextra::fviz_cluster()


Number of best clusters


OTUbased
Number_clusters     Value_Index 
         2.0000         22.3718 
Phylum
Number_clusters     Value_Index 
         2.0000        114.5654 
Class
Number_clusters     Value_Index 
         2.0000         73.9034 
Order
Number_clusters     Value_Index 
           2.00           42.17 
Family
Number_clusters     Value_Index 
          2.000          31.017 
Genus
Number_clusters     Value_Index 
          2.000          24.511 


Visualization of best clusters

Figure x: Optimal number of OTU clusters. The suggested number of best clusters (dotted line) thta could expllain most variation is 2 for OTUs (A), 3 for phylum (B), 3 for class (C), 2 for Order (D), 10 for Family (E) and 2 for Genus (F). A high average silhouette width indicates high quality clustering.


Cluster validation

  cluster size ave.sil.width
1       1  232          0.67
2       2  128          0.25
  cluster size ave.sil.width
1       1  240          0.68
2       2  120          0.24
  cluster size ave.sil.width
1       1  239          0.67
2       2  121          0.22
  cluster size ave.sil.width
1       1  239          0.67
2       2  121          0.22
  cluster size ave.sil.width
1       1  229          0.67
2       2  131          0.19
  cluster size ave.sil.width
1       1  241          0.68
2       2  119          0.29

Figure x: Silhouette plot guided by the best number of clusters. Observations with a large Si (almost 1) are very well clustered. A small Si (around 0) means that the observation lies between two clusters while a negative Si are probably placed in the wrong cluster.



Ordination projections


PCA (Principal Component Analysis)


  • Identifies smaller number of uncorrelated variables (principal components) from a large set of data.
  • Explains the maximum amount of variance with the minimum number of principal components.
  • Missing values are replaced by the column mean
  • Use scree plot to estimate which components explain most of the variability in the data

Scree plot

Figure x: Scree plot of PCA. Shows which components explain most of the variability in the data. Over 80% of the variances contained in OTU and taxonomy data are retained by the first two principal components. The first PC explains the maximum amount of variation in the data set.


While PCA is based on Euclidean distances the PCoA is based on the (dis)similarity matrix calculated from OTU abundance data as described earlier. Literally, in any successful PCA or PCoA the first few axes are supposed to capture most of the variation in the input data. NMDS tries to substitute the original distance data with ranks. Unlike the PCA and PCoA the NMDS axes of the ordination are not ordered according to the variance they explain, instead a plot of stress values (a measure of goodness-of-fit) against dimensionality can be used to assess the proper choice of dimensions. Note that stress values >0.2 are generally considered hard to interpret, whereas values <0.1 are good and <0.05 are the better. In any case the inflexion point on scree plots and Shepard plots (stress plots) can be used to guide the selection of a minimum number of dimensions to use in the interpretation of the multidimensional data.


NMDS (Non-metric multidimensional scaling)


  • Goodness of fit and Sheperd plot can be used to determine the god or poor fit. Sum of squared values is equal to squared stress. Large values indicate poor fit.

Sherperd plots: Goodness of fit and stress values


Ordination of sites (samples)


Ordination of OTUs and species at all levels

Figure ??. Goodness and stress plots for Bray-Curtis-based dissimilarity metrics calculated for the OTU abundance.


Phylogenetic clustering and annotation


Figure x: Sample Phylip or Newick-formatted tree clustered using the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm. Similar data was used to construct different types of tree including rectangular (A), circular (B) and unrooted (C) to view how samples were clustered.



Posible questions


Alpha diversity

  • QN1: Are the values obtained too sensitive to sampling?
  • QN2: Was the sampling effort sufficient to account for most OTUs present in a sample?
  • QN3: Is there a need to continue with re-sampling?
  • QN4: …….?


Beta diversity

  • QN1: …….?
  • QN2: …….?
  • QN3: …….?
  • QN4: …….?


——————————————————–

More intervention by investigators

Oksanen, Jari, Guillaume Blanchet, Michael Friendly, Roeland Kindt, Pierre Legendre, Dan McGlinn, Peter R. Minchin, et al. 2018. “Vegan: Community Ecology Package.” R Package, nos. 2.5-2. https://cran.r-project.org/web/packages/vegan/vegan.pdf.